Nature Medicine — Latest Matching Preprints

1

Human vs AI Clinical Assessment: Benchmarking a Multimodal Foundation Model Against Multi-Center Expert Judgment on the Mental Status Examination.

Mwangi, B.; Jabbar Abdl Sattar Hamoudi, H.; Sanches, M.; Dogan, N.; Chaudhary, P.; Wu, M.-J.; Zunta-Soares, G. B.; Soares, J. C.; Martin, A.; Soutullo, C. A.

2026-04-20 psychiatry and clinical psychology 10.64898/2026.04.17.26351105 medRxiv

Top 0.1%

52.1%

Show abstract

The Mental Status Examination (MSE) is the cornerstone of the psychiatric evaluation, yet validating artificial intelligence (AI) against the inherent variance of clinical judgment remains a critical bottleneck. Here we introduce a multi-center framework to benchmark the open-weight multimodal foundation model Qwen3-Omni against independent expert panels at two sites, UTHealth and Yale. Evaluating 396 classifications across 10 MSE domains and three longitudinal timepoints of increasing symptom severity, we found that experts achieved substantial agreement (Gwets AC1 = 0.87), whereas the model achieved only moderate alignment (AC1 = 0.70-0.72). Even as the models overall pathology prediction rate approximated the experts, the aggregate equilibrium masked a profound "clinical reasoning gap". Specifically, the model systematically over-predicted observable signs (e.g., speech, affect) while notably failing in inferential domains requiring the interpretation of latent mental content (e.g., delusions, perceptions). A 4-bit quantization analysis of the model confirmed this mechanistically: reducing model capacity disproportionately degraded inferential reasoning while preserving perceptual feature extraction. Furthermore, model-to-expert agreement degraded linearly as clinical complexity intensified across longitudinal visits (Accuracy: T0 = 84.8-87%; T1 = 80-82%; T2 = 71-73%), whereas expert consensus remained robust. Notably, model errors increased 2.3-to-3.4 fold where human experts disagreed. These findings establish inter-expert variance as an essential measurable baseline for psychiatric AI, demonstrating that true clinical translation requires models to move beyond multimodal perceptual extraction to achieve higher-order diagnostic reasoning.

2

Solving Emergency Department Triage with Small Language Models

Belski, V.; Lukina, K.

2026-05-05 health policy 10.64898/2026.05.04.26352355 medRxiv

Top 0.1%

43.8%

Show abstract

Emergency department (ED) triage assigns patients a five-level Emergency Severity Index (ESI) score that determines care priority. We investigate the feasibility of automating this process, comparing large commercial models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, MedGemma) against a purpose-built pipeline combining a small extraction model with a deterministic clinical engine, and a 9B-parameter language model trained with structured chain-of-thought supervision and reinforcement learning. Off-the-shelf large models achieve only 45-55% exact ESI accuracy while being impractical for clinical deployment due to privacy constraints, cost, and latency. Our specialized BiomedBERT [4] pipeline achieves 88.9% exact accuracy with 97.2% adjacent accuracy ({+/-}1 ESI) on a 50-case expert-labeled evaluation set, approaching nurse inter-rater agreement. A Qwen3.5-9B model [16] fine-tuned with chain-of-thought supervision achieves 75.0% exact / 97.2% adjacent accuracy on a 36-case narrative evaluation. Ongoing GRPO training [13] with a clinically asymmetric reward function and 2,776 ESI-1 narrative training cases (previously 22, due to a discovered extraction bug) shows strong early reward signal. We document 37+ BERT experiments, multiple LLM training cycles, systematic data quality audits, and the specific engineering decisions that enabled progress, including the discovery that 71% of training labels for altered mental status were false positives.

3

A Multimodal Framework for Organ- and Cell-Resolved Biological Aging and Longevity Intervention Discovery

Al Dajani, S. A.; Williams, J. R.; Fuentealba, M.; Zhai, T.; Furman, D.; Snyder, M.; Abudayyeh, O. O.; Gootenberg, J. S.; Gladyshev, V. N.

2026-05-12 geriatric medicine 10.64898/2026.05.08.26352759 medRxiv

Top 0.1%

32.1%

Show abstract

Aging is the primary driver of chronic disease and mortality, requiring comprehensive frameworks for quantification of aging and nomination of longevity interventions. We developed mAge (multimodal age), a biological aging framework that integrates plasma proteomics, wearables, and mortality hazard to predict biological age, intrinsic capacity, and mortality risk. By combining proteomic and wearable data in UK Biobank samples, mAge exceeds unimodal baseline age prediction to 0.87 test R{superscript 2} and 2.3 years mean error, and reduces unimodal baseline mortality prediction error by 21%. We further constructed organ-and cell type-specific biological clocks that quantify aging across 49 distinct subsystems, revealing that cardiac, immune, and intracellular protein signatures benefit most from wearable integration. By mapping data to FDA-approved drug targets, we identified interventions, such as GLP-1 receptor agonists, gabapentin, and ACE inhibitors, that are associated with lower overall and subsystem-specific proteomic age and mortality risk or are associated with longer time-to-death and later age-at-death in longitudinal and deceased cohorts. mAge establishes a scalable framework for nominating and validating personalized longevity interventions, bridging continuous digital monitoring with molecular aging diagnostics.

4

Integrative, and Scalable mental health phenotyping using a knowledge-graph-derived dual-metric framework

Sharma, A.; Bharadwaj, A.; Modi, S.; Ahuja, G.; Jain, A.; Kumar, K.

2026-03-16 psychiatry and clinical psychology 10.64898/2026.03.09.26347798 medRxiv

Top 0.1%

28.4%

Show abstract

Prevailing diagnostic instruments for anxiety and depression, though clinically indispensable, remain anchored to symptom-focused queries that assess patients directly about their affective states, while often neglecting the multidimensional architecture of daily living. Here, we introduce two complementary metrics, the Cognitive Attention Score (CAS) and C:ERR (Cognition-to-Emotional-Response Ratio), derived from yogic psychology and operationalized within a structured knowledge graph (Ceekr-KG) comprising 151,288 triples linking 354 discrete CAS levels, 26 continuous C:ERR values, and 80 clinical symptoms. Rather than interrogating disease phenotypes directly, these metrics are computed by capturing circadian, nutritional, and lifestyle factors that jointly regulate cognitive and emotional homeostasis. Hyperparameter-tuned Ceekr-KG model demonstrated high structural fidelity (Hits@1 = 97%, mean reciprocal rank = 0.98), substantially outperforming relation-preserving randomized controls, indicating that predictive performance arises from semantic structure rather than graph topology alone. CAS and C:ERR showed a strong positive association (Spearmans {rho} = 0.787, p < 0.0001) but exhibited distinct distributional properties, with C:ERR displaying consistently stronger inverse correlations with symptom severity across domains (e.g., low energy: {rho} = -0.85 versus -0.70 for CAS). Ordinal regression further showed that a combined CAS and C:ERR model outperformed either metric alone for most symptoms, indicating complementary and non-redundant contributions to clinical variance. Integration of Ceekr-KG into the independent Clinical Knowledge Graph improved predictive performance of widely used questionnaire-based assessment scales, demonstrating that yogic psychological frameworks encode clinically relevant semantic information. Finally, longitudinal analysis of 249 individuals meeting predefined inclusion criteria (baseline CAS < 64 and >=2 assessments) across three therapeutic programmes revealed a mean CAS increase of +11.45 points (p < 0.001) and substantial migration from lower to higher functional bands, establishing Ceekr-KG as a validated digital phenotype for scalable mental health assessment.

5

Artificial Intelligence Agents in Mental Health: A Systematic Review and Meta Analysis

Zhu, L.; Wang, W.; Liang, Z.; Tan, W.; Chen, B.; Lin, X.; Wu, Z.; Yu, H.; Li, X.; Jiao, J.; He, S.; Dai, G.; Niu, J.; Zhong, Y.; Hua, W.; Chan, N. Y.; Lu, L.; Wing, Y. K.; Ma, X.; Fan, L.

2026-04-22 psychiatry and clinical psychology 10.64898/2026.04.21.26351365 medRxiv

Top 0.1%

27.4%

Show abstract

The rapid rise of large language models (LLMs) and foundation models has accelerated efforts to build artificial intelligence (AI) agents for mental health assessment, triage, psychotherapy support and clinical decision assistance. Yet a gap persists between healthcare and AI-focused work: while both communities use the language of "agents," clinical research largely describes monolithic chatbots, whereas AI studies emphasize agentic properties such as autonomous planning, multiagent coordination, tool and database use and integration with multimodal mental health data streams. In this Review, we conduct a systematic analysis of mental health AI agent systems from 2023 to 2025 using a six-dimensional audit framework: (i) system type (base model lineage, interface modality and workflow composition, from rule-based tools to role-aware multi-agent foundation-model systems), (ii) data scope (modalities and provenance, from elicited self-report and chatbot dialogues to electronic health records, biosensing and synthetic corpora), (iii) mental health focus (mapped to ICD-11 diagnostic groupings), (iv) demographics (age strata, geography and sex representation), (v) downstream tasks (screening/triage, clinical decision support, therapeutic interventions, documentation, ethical-legal support and education/simulation) and (vi) evaluation types (automated metrics, language quality benchmarks, safety stress tests, expert review and clinician or patient involvement). Across this corpus, we find that most systems (1) concentrate on depression, anxiety and suicidality, with sparse coverage of severe mental illness, neurocognitive disorders, substance use and complex comorbidity; (2) rely heavily on text-based self-report rather than clinically verified longitudinal data or genuinely multimodal inputs; (3) are implemented as single-agent chatbots powered by general-purpose LLMs rather than role-structured, workflow-integrated pipelines; and (4) are evaluated primarily via offline metrics or vignette-based scenarios, with few prospective, clinician- or patient-in-the-loop studies. At the same time, an emerging class of agentic systems assigns foundation models explicit roles as planners, retrieval agents, safety auditors or supervisors coordinating other models and tools. These multiagent, tool-augmented workflows promise personalization, safety monitoring and greater transparency, but they also introduce new risks around reliability, bias amplification, privacy, regulatory accountability and the blurring of clinical versus non-clinical roles. We conclude by outlining priorities for the next generation of mental health AI agents: clinically grounded, role-aware multi-agent architectures; transparent and privacy-preserving use of clinical and elicited data; demographic and cultural broadening beyond predominantly Western adult samples; and evaluation pipelines that progress from offline benchmarks to longitudinal, real-world studies with routine safety auditing and clear governance of responsibilities between agents and human clinicians.

6

MedMisBench: Measuring Epistemic Resilience of LLMs Under Misleading Medical Context

Zhou, H.; Zou, X.; Wu, J.; Wu, S.; Wu, J.; Segal, B. M.; Niebuhr, T. E.; Amro, S.; Petrus, M.; Momin, S.; Cardoso Pinto, A.; Niesen, R.; Wegner, L. S.; Darji, D.; Koo, J. M.; Fieggen, J.; Narain, K.; Zeng, M.; Clifton, L.; Shapiro, L.; Liu, F.; Clifton, D. A.

2026-05-28 bioengineering 10.64898/2026.05.25.727671 medRxiv

Top 0.1%

25.9%

Show abstract

Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assumption is fragile: when misleading context is injected into questions that LLMs originally answer correctly, they abandon the correct answer. We call the ability to maintain correct judgment under adversarial context epistemic resilience, and introduce MedMisBench to measure it. MedMisBench contains 10,932 medical question items and 48,889 misleading context-option pairs spanning medical reasoning, agentic capability, and patient-journey evaluation. Across 11 model configurations, mean accuracy falls from 71.1% on original questions to 38.0% under focused misleading context, with 51.5% attack success. The most damaging injections are formal, rule-like fabrications: authority-framed falsehoods reach 69.5% attack success and exception-poisoning claims reach 64.1%. A 14-member clinical panel from 7 countries identified serious potential harm in 38.2% of reviewed cases. MedMisBench exposes a structural blind spot in LLM evaluation in medical settings: existing benchmarks measure what models know, but not whether they preserve correct medical judgment under misleading context.1

7

Conserved neuroectodermal aging encodes primate health and longevity

Yang, S.; Xin, Z.; Wang, W.

2026-05-07 geriatric medicine 10.64898/2026.05.05.26352498 medRxiv

Top 0.1%

22.9%

Show abstract

Neuroectoderm-derived tissues are highly metabolically active and exhibit minimal regenerative turnover, rendering them uniquely vulnerable to age-related stress while preserving undiluted degenerative signals. Yet aging dynamics in these tissues remain elusive in living primates. Here, we introduce an in vivo neuroectodermal aging clock and trace its trajectory in 66,602 human adults and six rhesus macaques across nine health and disease cohorts using an in situ optical biopsy. Through a digital histology atlas integrated with artificial intelligence, we resolve tissue representations of neuroectodermal aging within the human retina, predominantly localized to the metabolically active ganglion and bipolar cell populations and the photoreceptor complex, while demonstrating their evolutionary conservation across primate species. Neuroectodermal aging predicts health and longevity, scales across space and time, and captures preclinical aging signals within and beyond the neuroectodermal compartment. This framework is further validated in a diabetic population, where robust prognostic and dynamic sensitivity are preserved across physiological and perturbed states. Our work establishes a scalable framework for resolving neuroectodermal aging in living primates and linking tissue-level vulnerability to systemic health trajectories.

8

A digital twin for hospital antimicrobial resistance forecasting and constrained intervention optimisation

Triantafyllidis, C. P.; Aguas, R.

2026-05-22 public and global health 10.64898/2026.05.15.26353296 medRxiv

Top 0.1%

22.6%

Show abstract

Hospital antimicrobial resistance (AMR) emanates from an array of complex interactions between patient turnover, heterogeneous patient--staff contact patterns, antibiotic-driven within-host selection, and imperfect surveillance. We present a hospital AMR digital twin that combines mechanistic simulation with temporal graph learning to forecast resistance emergence from evolving daily contact networks and enable support intervention planning. Our approach is twofold: graph neural networks and transformers to model predictions and mathematical programming optimization to provide decision support. The main predictive task asks whether future spread of resistant infections is more likely to be driven by endogenous hospital transmission and selection or from importation on admission. We evaluated this task under both fully observed and partially observed settings, using baseline benchmarks together with ablations, surveillance perturbations, and distribution-shift stress tests. Under canonical conditions, the model achieved very strong predictive performance, especially when ground-truth system states were available, and remained informative under partial observation. Ablations showed that contact-weight information was relatively robust, whereas compressed node-feature representations weakened performance more noticeably when observations were incomplete. Surveillance stress tests further showed that delayed or less frequent reporting can be tolerated in some settings, but threshold calibration becomes fragile under more severe observation changes. Across broader epidemiological and surveillance shifts, the ground-truth model generally preserved strong ranking ability, while partial-observation performance was less stable. When models were trained directly in the shifted regime, performance improved compared with zero-shot transfer, indicating that the digital twin can adapt to new and previously unseen operating conditions but that portability across regimes can improve, particularly when only partial surveillance data are available. We also evaluated intervention-conditioned forecasting by branching hospital states into a small library of screening and isolation policies under a shock-and-superspreader regime. The learned models supported useful within-state action ranking and frequently identified policies that improved on the baseline containment protocols or avoided worsening outcomes. The same digital twin can also support constrained intervention selection, although reliable deployment will require careful calibration, improved robustness under partial observation, and broader policy libraries.

9

NeuroFM: Toward Precision Neuroimaging with Foundation Models for Individualized Brain Health Estimation

Dibble, A.; Dalby, C.; Sevegnani, M.; Fracasso, A.; Lyall, D. M.; Harvey, M.; Svanera, M.

2026-03-31 neurology 10.64898/2026.03.27.26349489 medRxiv

Top 0.1%

22.5%

Show abstract

Precision neuroimaging aims to deliver individualized assessments of brain health, yet a single structural MRI does not yield a multidimensional, quantitative summary of an individual's current health or future risk. Existing approaches optimize task-specific objectives, yielding representations entangled with cohort- or disease-specific signals rather than capturing biologically grounded patterns of anatomical variation. Here, we introduce NeuroFM, a foundation model trained exclusively on 100,000 healthy synthetic volumes to predict morphometric and demographic targets. Without exposure to diagnostic labels, NeuroFM organizes brain MRIs into population-level patterns that encode meaningful brain health differences. These representations transfer across five neuroscience domains without adaptation and support simple linear readouts for clinical, cognitive, developmental, socio-behavioural, and image quality control. Evaluated on 136,361 real volumes spanning multiple cohorts, NeuroFM generalizes across domains and enables individual-level brain health profiling, estimating future dementia risk years before diagnosis. Together, these findings establish a disease-naive foundation model paradigm for precision neuroimaging.

10

BETA: Resting-state fMRI Biotypes for tDCS Efficacy in Anxiety Among Older Adults At Risk For Alzheimer's Disease

Stolte, S. E.; Cheng, J.; Acharya, C.; Gu, L.; O'Shea, A.; Indahlastari, A.; Woods, A. J.; Fang, R.

2026-04-27 neurology 10.64898/2026.04.24.26351493 medRxiv

Top 0.1%

22.4%

Show abstract

Anxiety is usually gauged by self-report, yet a single symptom level can reflect disparate neural circuitry. In Alzheimers disease and related dementias (ADRD) this heterogeneity becomes a barrier to effective neuromodulation: some patients may benefit from transcranial direct-current stimulation (tDCS), while others may not. To overcome this obstacle, we introduced BETA (Biotypes for tDCS Efficacy in Anxiety), a data-driven pipeline that uses resting-state fMRI functional connectivity to derive anxiety subtypes that are intrinsically linked to tDCS response. A transformer-based variational autoencoder compresses high-dimensional connectivity into a 50-dimensional latent embedding that emphasizes networks implicated in cognitive aging and anxiety. A deep-embedded clustering loss, regularized by a clinically informed term that pulls together individuals who exhibit similar post-tDCS anxiety change, yields four distinct subtypes. Across all subtypes, disrupted coupling between sensory-processing and higher-order cognitive regions emerges as a common hallmark. Crucially, one cluster is resistant to frontal-lobe tDCS, whereas two clusters demonstrate significant anxiety reduction following stimulation. The responsive subtypes are defined by strengthened connectivity between the lateral occipital cortex-superior division (sLOC) and medial frontal cortex (MedFC), and between sLOC and the intracalcarine cortex (ICC). BETA demonstrates that fMRI-based subtyping can directly identify which patients are likely to benefit from tDCS, providing a concrete roadmap for precision psychiatry in ADRD and facilitating tailored therapeutic strategies for anxiety.

11

Domain-adapted language model using reinforcement learning for various dementias

Kowshik, S. S.; Jasodanand, V. H.; Bellitti, M.; Puducheri, S.; Xu, L.; Liu, Y.; Saichandran, K. S.; Dwyer, B. C.; Gabelle, A.; Hao, H.; Kedar, S.; Murman, D. L.; O'Shea, S.; Saint-Hilaire, M.-H.; Samudra, N. P.; Sartor, E. A.; Swaminathan, A.; Taraschenko, O.; Yuan, J.; Au, R.; Kolachalama, V. B.

2026-03-23 neurology 10.64898/2026.03.17.26348154 medRxiv

Top 0.1%

22.1%

Show abstract

Large language models excel at processing complex clinical data and advanced reasoning, yet domain-specific adaptation is essential to realize their full potential in fields such as Alzheimers disease and related dementias (ADRD). Here, we present a generative language model for ADRD fine-tuned via reinforcement learning with verifiable rewards using a self-certainty-aware advantage. Model development and validation leveraged data from five ADRD cohorts, totaling 54, 535 participants. Our framework integrates demographics, personal and family medical histories, medication use, neuropsychological test results, functional assessments, physical and neurological examination findings, laboratory data and multimodal neuroimaging to construct comprehensive clinical profiles. On held-out testing data involving 36, 688 participants, our model achieved robust performance on syndromic classification, primary etiological diagnosis and biomarker prediction. Model predictions were validated against postmortem-confirmed diagnoses, and clinical utility was demonstrated in a controlled within-subjects crossover study where board-certified neurologists reviewed cases with and with-out model assistance, showing that exposure to model responses improved diagnostic performance. These results demonstrate that targeted domain adaptation with reinforcement learning can enable language models to deliver accurate, reasoning-driven support in ADRD evaluation. Prospective validation will be essential to translate these advances into improved patient outcomes.

12

Genomic characterization of the 2024/2025 Mpox outbreak in Uganda

Kanyerezi, S.; Ayitewala, A.; Nsawotebba, A.; Makoha, C.; Tusabe, G.; Kabahita, J. M.; Oundo, H. R.; Seruyange, J.; Tenywa, W.; Were, S.; Murungi, M.; Nakintu, V.; Sserwadda, I.; Onywera, H.; Tanui, C.; Mugerwa, I.; Kagirita, A.; Lubwama, B.; Michael, E. R.; Kateete, D. P.; Otita, M.; Giduddu, S.; Jjingo, D.; Mboowa, G.; Ssemaganda, A.; Nabadda, S.; Tessema, S. K.; Ssewanyana, I.

2026-03-17 public and global health 10.64898/2026.03.16.26348494 medRxiv

Top 0.1%

21.8%

Show abstract

Mpox has historically been endemic in Central and West Africa, driven by recurrent zoonotic spillover events, but recent outbreaks in East Africa underscore its expanding geographic footprint. Despite this shift, genomic data from East Africa remain limited. We performed genomic characterization of the 2024/2025 Mpox outbreak in Uganda using PCR-confirmed monkeypox virus (MPXV) positive samples (n=511) from 44 districts, all achieving [≥]70% genome coverage. To provide regional context, we incorporated 895 publicly available clade Ib MPXV genomes from GISAID, Pathoplexus, and NCBI. Phylogenetic analysis revealed two major clusters within clade Ib, each subdivided into two subclusters, indicating substantial viral diversification. Most Ugandan sequences clustered within the most genetically diverse subcluster. Additional Ugandan genomes were distributed across other subclusters, indicating co-circulation of multiple lineages. Cluster 1 was dominated by sequences from the Democratic Republic of Congo, while phylogeographic analysis identified multiple cross-border introductions into Uganda. These findings highlight the role of regional connectivity in shaping MPXV transmission and underscore the importance of integrated genomic surveillance and cross-border data sharing to inform outbreak response in East and Central Africa.

13

Stabilized gp120-specific CD4 for next-generation HIV-1 inhibitors

Bahn-Suh, A. J.; Caldera, L. F.; Gnanapragasam, P. N. P.; Keeffe, J. R.; Seaman, M. S.; Bjorkman, P. J.; Mayo, S. L.

2026-03-27 bioengineering 10.64898/2026.03.24.713825 medRxiv

Top 0.1%

19.3%

Show abstract

HIV-1 Envs gp120 subunit uses the T-cell coreceptor CD4 to enter host cells in a manner that prevents the evolution of host resistance by sharing the binding epitope with the footprint of CD4s natural ligands, class II MHC proteins1,2. Consequently, CD4-containing biologics, such as CD4-Ig3,4 and derivatives5-9, benefit from this conserved relationship and are promising broad-acting anti-HIV-1 agents that are resistant to viral mutational escape10. However, these biologics suffer from short serum half-lives in humans11,12 and animals3,13, likely due to CD4s poor thermostability14 and/or off-target class II MHC binding15. This latter property also warrants caution for CD4-containing biologics that could indiscriminately recruit Fc-dependent effector functions against uninfected cells and/or compete with host CD4 for class II MHC during T cell interactions with antigen-presenting cells. Here, we describe gp120-specific CD4 (gCD4), which exhibits enhanced thermostability and retains Env, but not class II MHC, binding. CD4-Ig variants incorporating gCD4 did not bind class II MHC on human B cells, displayed greater longevity in human tonsil organoid cultures, showed half-lives equivalent to therapeutic IgG antibodies in mice, and neutralized HIV-1 more broadly and potently compared to the original CD4-Ig molecules. Encouragingly, one variant neutralized 100% of a panel of clinically-relevant HIV-1 strains at titers correlating to infection prevention in humans, outperforming known broadly neutralizing antibodies16,17. Thus, gCD4 holds promise for the development of new CD4-containing biologics with best-in-class specificity, pharmacokinetic properties, and neutralization breadth and potency.

14

Evaluating Large Language Models for Assessment of Psychosis Risk

Zhu, T.; Tashevski, A.; Taquet, M.; Azis, M.; Jani, T.; Broome, M. R.; Kabir, T.; Minichino, A.; Murray, G. K.; Nour, M. M.; Singh, I.; Fusar-Poli, P.; Nevado-Holgado, A.; McGuire, P.; Oliver, D.

2026-04-04 psychiatry and clinical psychology 10.64898/2026.04.02.26349960 medRxiv

Top 0.1%

19.2%

Show abstract

Psychosis prevention relies on early detection of individuals at clinical high risk for psychosis (CHR-P) remains limited, constraining preventive care. The effectiveness of the CHR-P state is constrained, in part due to clinical assessments requiring specialist interpretation of narrative interviews, limiting scalability. Here, we evaluate whether large language models (LLMs; deep learning models trained on large text corpora to process and generate language) can extract clinically meaningful information from such interviews to support psychosis risk assessment. We assessed 11 open-weight LLMs on 678 PSYCHS interview transcripts from 373 participants (77.7% CHR-P). Models inferred CHR-P status and estimated severity and frequency across 15 symptom domains, benchmarked against researcher-rated scores. Larger models achieved the strongest classification performance (Llama-3.3-70B: accuracy = 0.80, sensitivity = 0.93, specificity = 0.58). LLM-generated symptom scores showed good correlations with researcher-rated scores (ICCsev = 0.74, ICCfreq = 0.75). Performance disparities were minimal across most demographic groups but varied across sites. Generated summaries were largely faithful to source transcripts, with low rates of clinically relevant confabulation (3%). Errors primarily reflected over-pathologisation of non-clinical experiences. While accuracy scaled with model size, smaller models achieved competitive performance with substantially lower computational cost. These findings demonstrate that open-weight LLMs can assess psychosis risk from clinical interview transcripts, supporting scalable, human-in-the-loop approaches to early detection.

15

Peer support boosted Hepatitis C treatment access among marginalised populations in England: A Bayesian causal factor analysis.

Schmidt, C.; Samartsidis, P.; Seaman, S.; Emmanouil, B.; Foster, G.; Reid, L.; Smith, S.; De Angelis, D.

2026-04-22 health policy 10.64898/2026.04.20.26351261 medRxiv

Top 0.1%

18.6%

Show abstract

To minimise health disparities, equitable access to medical treatment is paramount. In a pioneering intervention, National Health Service Englands Hepatitis C virus (HCV) programme has implemented country-wide peer support to boost treatment access. Peer support workers (peers) are individuals with relevant lived experience, who promote testing and treatment in marginalised populations underserved by traditional health services. We evaluated the English peers intervention, exploiting its staggered rollout and rich surveillance data between June 2016 and May 2021. Peers increased HCV cases identified by 13{middle dot}9% (95% credible interval (95% CrI) [5{middle dot}3, 21{middle dot}7]), sustained viral responses by 8{middle dot}0% (95% CrI [-4{middle dot}4, 18{middle dot}6]), and drug services referrals by 8{middle dot}8% (95% CrI [-12{middle dot}5, 22{middle dot}6]). The interventions effectiveness was magnified during the first COVID-19 lockdown and individuals supported by peers typically belonged to populations with poor treatment access. Our findings indicate that peers can boost equity in treatment access on a national scale.

16

Hierarchical organ aging signatures from routine abdominal CT add incremental disease risk stratification beyond blood biomarkers

Deng, Z.; Wang, Y.; Shi, Y.; Wang, L.; Qureshi, T. A.; Gaddam, S.; Javed, S.; Hsu, Y.-C.; De Righi, D. R.; Azab, L.; Diwan, G.; Yang, J. D.; Xie, Y.; Yuan, C.; Vendrami, C. L.; Rodriguez, A.; Specht, K.; Jeon, C. Y.; Chaudhry, H.; Buxbaum, J.; Pisegna, J. R.; Yaghmai, V.; Goessling, W.; Hernandez-Barco, Y. G.; Miller, F. H.; Tirkes, T.; Espinoza, S.; Musi, N.; Dey, D.; Sung, K. H.; Pandol, S. J.; Li, D.

2026-05-27 radiology and imaging 10.64898/2026.05.19.26353206 medRxiv

Top 0.1%

18.5%

Show abstract

Biological aging is heterogeneous across organ systems, yet whether CT-derived abdominal aging provides prognostic value beyond routine clinical data and whether organ decomposition adds beyond a unified estimate remains untested. We developed and evaluated organ-specific and ensemble biological age models from radiomic features across five abdominal organs in 68,675 CT scans from 32,883 subjects, evaluated on alignment with chronological age of healthy subjects (nested cross validation: MAE=3.68 years, R^2=0.90). In sequential analyses restricted to adults aged 20-60 years which is the stratum of strongest BAG-disease association, ensemble biological age gaps provided incremental prognostic value beyond demographic covariates for all-cause disease and mortality (Delta C-index=0.141, 0.051) and beyond routine blood biomarkers (Delta C-index=0.048), confirming CT-derived aging captures structural information beyond laboratory markers. Organ-specific biological age added incremental prognostic value beyond ensemble selectively for focal diseases: cardiovascular (aorta, Delta C-index=0.091) and hepato-pancreatic (pancreas, Delta C-index=0.096). These findings establish a hierarchical organization of CT-derived biological aging, positioning routine CT as a source that adds prognostic value to existing clinical biomarkers.

17

The Evolutionary Dynamics and Regional Spread of Mpox in Africa: Insights from Multi-country Genomic Surveillance

Tanui, C. K.; Kinganda-Lusamaki, E.; O'Toole, A.; Chitenje, M.; Campbell, A. K. O.; DIAGNE, M. M.; Kanyerezi, S.; Faye, M.; Ifabumuyi, S. O.; Nzoyikorera, N.; Lango, H. O.; Koukouikila-Koussounda, F.; Meite, S.; Sikazwe, E.; Djuicy, D. D.; Adu, B.; MAMAN, I.; Mapunda, L. A.; Nyan, D. C.; Stephane, S.; Aricha, S. A.; Cherif Gnimadi, T. A.; Maror, J. A.; Pereira, A. M.; Atrah, Y. S.; Akanbi, O. A.; Lokilo, E. L.; Makangara-Cigolo, J.-C.; Paku, P. T.; Luakanda, G. N.; Amuri-Aziza, A.; Wawina-Bokalanga, T.; Mugerwa, I.; Nsawotebba, A.; Ayitewala, A.; Williams, A. J.; Folorunso, V.; Mani, S.; Hardi

2026-04-11 infectious diseases 10.64898/2026.04.07.26347884 medRxiv

Top 0.1%

18.3%

Show abstract

The recent MPXV epidemic across Africa revealed extensive viral diversity and complex transmission dynamics, prompting a continent-wide genomic investigation. We analysed 3,450 high-quality MPXV virus whole genomes from 24 African Union Member States, revealing the complex and concurrent circulation of Sub-clades Ia, Ib, IIa, and IIb. Subclade Ia showed high levels of virus diversity in reservoir hosts in Central Africa, detected through zoonotic transmission and some sustained human outbreak lastly detected. In contrast, Clade Ib exhibited signatures of sustained human-to-human transmission across Eastern and Southern Africa. Clade IIa remains largely zoonotic in West Africa. Like Ia, IIb shows continued zoonotic transmission, and sustained human outbreak linked to lineage G1 and G2 circulation. Phylogeographic analyses revealed frequent cross-border transmission and interconnectedness, which was aligned with both human mobility corridors and international boundaries. For instance, the Democratic Republic of the Congo or Sierra Leone seems to emerge as a source of regional exportation, while the Cameroon-Nigeria, CAR-Cameroon or CAR-DRC interfaces reflected ongoing cross-border zoonotic spillovers. These findings underscore the need for harmonised genomic surveillance, APOBEC3-aware triage, and integrated One Health strategies to prevent local outbreaks from escalating into regional epidemics and to inform vaccine deployment and public health preparedness.

18

A Cerebral Frailty Risk Score Integrating Frailty Index and Neuroimaging for Dementia Prediction in the UK Biobank

Kan, C. N.; Chew, J.; Lim, W. S.; Tan, C. H.

2026-04-04 geriatric medicine 10.64898/2026.04.01.26350015 medRxiv

Top 0.1%

18.3%

Show abstract

Frailty is a multisystem clinical syndrome closely linked to cognitive aging, yet its cerebral underpinnings and co-contribution to adverse outcomes remain poorly understood. In 63,509 dementia-free UK Biobank participants (aged 65.0{+/-}7.7), higher frailty index (FI) was associated with multiple neuroimaging markers, including reduced hippocampal volume, decreased cortical thickness, greater white matter hyperintensities burden, and impaired brain diffusion metrics. FI and neuroimaging markers additively increased the risks of incident dementia and mortality. An extreme gradient boosting with accelerated failure time framework highlighted FI and key regional neuroimaging features in dementia risk prediction (nested C-index=0.825, iAUC=0.759). Integrating the top 10 predictors into a novel point-based cerebral frailty risk score (CFRS) showed strong performance in predicting dementia onset (optimism-corrected C-index=0.838, iAUC=0.778), and was robust to the competing risk of mortality. These findings highlight the potential utility of a CFRS framework that integrates cumulative systemic and cerebral vulnerabilities for dementia risk stratification.

19

Real-World Validation of Machine Learning Models for HIV Treatment Adherence Prediction and Care Gap Quantification: A Multi-Country Analysis of 192,732 Clinical Records

Chinthala, L. K.

2026-05-19 hiv aids 10.64898/2026.05.15.26353325 medRxiv

Top 0.1%

18.1%

Show abstract

Delayed diagnosis and poor antiretroviral therapy (ART) adherence remain primary drivers of HIV-related morbidity in low-resource settings, yet real-world AI validation at scale is lacking. We conducted a retrospective validation study using two publicly available, de-identified datasets: a Quality of Care cohort of 27,288 HIV-positive patients on ART across multiple healthcare facilities, and the CEPHIA multi-country assay database comprising 165,444 specimen records from six countries. Four machine learning classifiers were evaluated using 10-fold stratified cross-validation with SMOTE applied strictly to training folds. Explicit data leakage prevention, ablation analysis, calibration assessment, and bootstrap confidence intervals were applied. Economic projections used one-way sensitivity analysis. This study adheres to TRIPOD reporting guidelines. Random Forest achieved AUC-ROC of 0.9753 (95% CI: 0.970-0.975), sensitivity 87.3% (95% CI: 86.4-88.2%), specificity 95.7% (95% CI: 95.2-96.2%), and Brier score 0.079. Ablation testing confirmed robustness (AUC 0.963 without the primary predictor). Temporal validation on held-out future patients yielded AUC 0.772 (95% CI: 0.744-0.802), confirming generalisation across time. Real-world analysis revealed median diagnosis-to-ART delay of 74 days, with 47.3% of patients exceeding 90 days and 36.7% presenting with CD4 below 200 cells per microlitre. Multi-country CEPHIA analysis identified 18.6% HIV recency within the 130-day early-intervention window. Decision curve analysis confirmed net clinical benefit across threshold probabilities 0.03-0.45. Subgroup analysis demonstrated consistent AUC across sex, age, CD4 strata, and WHO staging (max difference 0.051). Economic modelling projected base-case savings of USD 415 per patient (USD 2.07 million per 5,000-patient cohort). These findings provide large-scale empirical evidence that AI-driven informatics can predict ART adherence failure and quantify systemic care gaps, offering a scalable framework for equitable HIV care delivery in resource-limited settings. Prospective external validation is required before clinical deployment.

20

Greater lean-body-mass decline with tirzepatide than semaglutide in routine care, revealed by body-composition digital phenotyping

Murugadoss, K.; Venkatakrishnan, A.; Soundararajan, V.

2026-04-13 endocrinology 10.64898/2026.04.11.26350687 medRxiv

Top 0.1%

18.0%

Show abstract

GLP-1 receptor agonists induce substantial weight loss, but the extent to which lean tissue and physical function are preserved in routine care remains poorly understood. Using an EHR-linked body-composition digital phenotyping pipeline with LLM-based extraction, we performed a large-scale longitudinal analysis of 670,422 first-episode GLP-1RA users, including 456,742 treated with semaglutide and 213,680 treated with tirzepatide. Among these, 7,965 individuals with paired pre- and post-initiation body-composition measurements were analyzed over 12 months. Tirzepatide was associated with greater relative lean body mass (LBM) loss than semaglutide at each measured time point, with excess LBM losses of 1.1%, 1.5%, 1.3% and 2% at 3, 6, 9 and 12 months, respectively. A Depletive GLP-1 metabotype, defined as >20% total body weight (TBW) loss with >5% LBM loss, was significantly more frequent with tirzepatide than semaglutide during the first year of therapy (10.3% versus 6.7%, p<0.001). By contrast, a Prime GLP-1 metabotype, defined as >10% TBW loss with <5% LBM loss, was numerically more frequent with semaglutide than tirzepatide, but not significantly so (12.3% versus 11.8%, p=0.66). Higher drug dose and longer exposure were associated with progressively greater LBM decline in both treatment groups (both p<0.001). Among 3,746 examined EHR phenotypes, baseline musculoskeletal pain emerged as the most significant correlate of greater LBM loss (BH-adjusted q<0.001): cervicalgia (semaglutide, -4.1 percentage points; tirzepatide, -14.3 percentage points) and knee pain (semaglutide, -4.8 percentage points; tirzepatide, -13.4 percentage points), consistent with mobility-limited patients being more vulnerable to lean-tissue depletion during incretin therapy. Analysis of EHR notes for on-treatment functional features showed reduced exercise tolerance was the strongest correlate of greater LBM loss, increasing by 7.2 and 11.1 percentage points in semaglutide- and tirzepatide-treated patients, respectively. An independent analysis of all available Single-cell RNA-seq data from human musculature showed broader GIPR+ cellular distribution than GLP1R+ cells across immune, stromal, vascular, and contractile compartments, providing plausible biological context for the greater LBM loss observed in routine care with tirzepatide (dual GLP1R-GIPR agonist) relative to semaglutide (GLP1R-specific agonist). In this observational study, greater weight-loss efficacy did not necessarily translate into more favorable body-composition outcomes, underscoring the need for clinical decision-making and trial designs that maximize each patients likelihood of achieving a Prime GLP-1 metabotype.